Wage data set considered throughout this chapter.set.seed(321)
cv.poly <- function(i, data, k){
form <- paste0('wage ~ poly(age, ', i, ')')
glm.out <- glm(as.formula(form), data = data)
cv.glm(data, glm.out, K = k)$delta[1]
}
cv.error <- lapply(1:20, cv.poly, data = Wage, k = 10)
models <- function(i, data){
form <- paste0('wage ~ poly(age, ', i, ')')
lm.out <- lm(as.formula(form), data = data)
lm.out
}
mods <- lapply(1:20, models, data = Wage)
print(which.min(cv.error))
## [1] 9
do.call(anova, mods)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
## Model 11: wage ~ poly(age, 11)
## Model 12: wage ~ poly(age, 12)
## Model 13: wage ~ poly(age, 13)
## Model 14: wage ~ poly(age, 14)
## Model 15: wage ~ poly(age, 15)
## Model 16: wage ~ poly(age, 16)
## Model 17: wage ~ poly(age, 17)
## Model 18: wage ~ poly(age, 18)
## Model 19: wage ~ poly(age, 19)
## Model 20: wage ~ poly(age, 20)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.4720 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8804 0.001687 **
## 4 2995 4771604 1 6070 3.8066 0.051145 .
## 5 2994 4770322 1 1283 0.8043 0.369884
## 6 2993 4766389 1 3932 2.4659 0.116446
## 7 2992 4763834 1 2555 1.6024 0.205660
## 8 2991 4763707 1 127 0.0794 0.778085
## 9 2990 4756703 1 7004 4.3924 0.036183 *
## 10 2989 4756701 1 3 0.0017 0.967562
## 11 2988 4756597 1 103 0.0647 0.799207
## 12 2987 4756591 1 7 0.0043 0.947939
## 13 2986 4756401 1 190 0.1189 0.730307
## 14 2985 4756158 1 243 0.1521 0.696580
## 15 2984 4755364 1 795 0.4983 0.480291
## 16 2983 4754082 1 1282 0.8040 0.369960
## 17 2982 4751750 1 2332 1.4625 0.226633
## 18 2981 4751687 1 63 0.0392 0.842974
## 19 2980 4751141 1 546 0.3422 0.558586
## 20 2979 4750430 1 711 0.4461 0.504223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the 9th order polynomial model fits better than an 8th order, but it does not seem that orders higher than 4 provide better fits than lower order models.
set.seed(123)
cv.cuts <- function(i, data, k){
data$tmp <- cut(data$age, i)
glm.out <- glm(wage ~ tmp, data = data)
cv.glm(data, glm.out, K = k)$delta[1]
}
cuts <- lapply(2:20, cv.cuts, data = Wage, k = 10)
which.min(cuts)
## [1] 10
cuts.min <- lm(wage ~ cut(age, which.min(cuts)), data = Wage)
preds <- predict(cuts.min, se = TRUE)
df <- data.frame(wage = Wage$wage, age = Wage$age , fitted = preds$fit, lower = preds$fit - 2*preds$se.fit, upper = preds$fit + 2*preds$se.fit)
ggplot(df, aes(x = age, y = wage)) + geom_point() + geom_line(aes(y = fitted), color = 'red') + geom_line(aes(y = lower), color = 'red', linetype = 2) +
geom_line(aes(y = upper), color = 'red', linetype = 2)
maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.First, we make a plot of the data.
gam.out1 <- gam(wage ~ s(age, bs = 'cr') + race + education + s(year, k = 4, bs = 'cr'), data = Wage)
summary(gam.out1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ s(age, bs = "cr") + race + education + s(year, k = 4,
## bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.648 2.185 39.655 < 2e-16 ***
## race2. Black -6.162 2.188 -2.816 0.00489 **
## race3. Asian -3.441 2.684 -1.282 0.19989
## race4. Other -7.614 5.848 -1.302 0.19302
## education2. HS Grad 10.649 2.431 4.380 1.23e-05 ***
## education3. Some College 23.390 2.562 9.129 < 2e-16 ***
## education4. College Grad 37.721 2.553 14.777 < 2e-16 ***
## education5. Advanced Degree 62.269 2.774 22.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 4.888 5.968 37.29 < 2e-16 ***
## s(year) 1.200 1.369 10.08 0.000372 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.292 Deviance explained = 29.5%
## GCV = 1238.4 Scale est. = 1232.6 n = 3000
plots <- visreg(gam.out1, plot = FALSE)
df <- ldply(plots, function(part)
data.frame(variable = part$meta$x,
x=part$fit[[part$meta$x]],
smooth=part$fit$visregFit,
lower=part$fit$visregLwr,
upper=part$fit$visregUpr))
df %>% dplyr::filter(variable == 'age') %>% ggplot(aes(x, smooth)) + geom_line() +
geom_line(aes(y=lower), linetype="dashed") +
geom_line(aes(y=upper), linetype="dashed") +
facet_grid(. ~ variable, scales = "free_x")
There is a non-linear relationship between age and wage.
dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.lm.out <- lm(nox ~ poly(dis, 3, raw = TRUE), data = Boston)
summary(lm.out)
##
## Call:
## lm(formula = nox ~ poly(dis, 3, raw = TRUE), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9341281 0.0207076 45.110 < 2e-16 ***
## poly(dis, 3, raw = TRUE)1 -0.1820817 0.0146973 -12.389 < 2e-16 ***
## poly(dis, 3, raw = TRUE)2 0.0219277 0.0029329 7.476 3.43e-13 ***
## poly(dis, 3, raw = TRUE)3 -0.0008850 0.0001727 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
ggplot(Boston, aes(y = nox, x = dis)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ poly(x, 3, raw = TRUE)')
map(1:10, get_rss, data = Boston, plots = FALSE)
## [[1]]
## [1] 2.768563
##
## [[2]]
## [1] 2.035262
##
## [[3]]
## [1] 1.934107
##
## [[4]]
## [1] 1.932981
##
## [[5]]
## [1] 1.91529
##
## [[6]]
## [1] 1.878257
##
## [[7]]
## [1] 1.849484
##
## [[8]]
## [1] 1.83563
##
## [[9]]
## [1] 1.833331
##
## [[10]]
## [1] 1.832171
set.seed(321)
cv.poly <- function(i, data, y, x, k){
form <- paste0(y,' ~ poly(', x, ', ', i, ', raw = TRUE)')
glm.out <- glm(as.formula(form), data = data)
cv.glm(data, glm.out, K = k)$delta[1]
}
degrees <- map(1:10, cv.poly, Boston, y = 'nox', x = 'dis', k = 10)
which.min(degrees)
## [1] 3
bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.bs.out <- lm(nox ~ bs(dis, df = 4), data = Boston)
summary(bs.out)
##
## Call:
## lm(formula = nox ~ bs(dis, df = 4), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.124622 -0.039259 -0.008514 0.020850 0.193891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73447 0.01460 50.306 < 2e-16 ***
## bs(dis, df = 4)1 -0.05810 0.02186 -2.658 0.00812 **
## bs(dis, df = 4)2 -0.46356 0.02366 -19.596 < 2e-16 ***
## bs(dis, df = 4)3 -0.19979 0.04311 -4.634 4.58e-06 ***
## bs(dis, df = 4)4 -0.38881 0.04551 -8.544 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06195 on 501 degrees of freedom
## Multiple R-squared: 0.7164, Adjusted R-squared: 0.7142
## F-statistic: 316.5 on 4 and 501 DF, p-value: < 2.2e-16
The knot is chosen at the median of the data.
attr(bs(Boston$dis, df = 4), 'knots')
## 50%
## 3.20745
p <- ggplot(Boston, aes(y = nox, x = dis))
p + geom_point() + geom_smooth(method = 'lm', formula = y ~ bs(x, df = 4))
map(4:20, spline_plots, data = Boston, y = 'nox', x = 'dis', plots = FALSE)
## [[1]]
## [1] 1.922775
##
## [[2]]
## [1] 1.840173
##
## [[3]]
## [1] 1.833966
##
## [[4]]
## [1] 1.829884
##
## [[5]]
## [1] 1.816995
##
## [[6]]
## [1] 1.825653
##
## [[7]]
## [1] 1.792535
##
## [[8]]
## [1] 1.796992
##
## [[9]]
## [1] 1.788999
##
## [[10]]
## [1] 1.78235
##
## [[11]]
## [1] 1.781838
##
## [[12]]
## [1] 1.782798
##
## [[13]]
## [1] 1.783546
##
## [[14]]
## [1] 1.779789
##
## [[15]]
## [1] 1.775838
##
## [[16]]
## [1] 1.774487
##
## [[17]]
## [1] 1.776727
set.seed(322)
cv.bs <- function(i, data, y, x, k){
formula1 <- paste0(y, ' ~ bs(', x, ', df = ', i, ')')
glm.out <- glm(as.formula(formula1), data = data)
cv.glm(data, glm.out, K = k)$delta[1]
}
dof <- map(4:20, cv.bs, data = Boston, y = 'nox', x = 'dis', k = 10)
which.min(dof)
## [1] 7
train <- sample(nrow(College)/2)
College.train <- College[train, ]
subset.out <- regsubsets(Outstate ~ ., data = College.train, method = 'forward' )
which.min(summary(subset.out)$bic)
## [1] 7
included <- names(coef(subset.out, which.min(summary(subset.out)$bic)))
included
## [1] "(Intercept)" "PrivateYes" "Room.Board" "Personal" "Terminal"
## [6] "perc.alumni" "Expend" "Grad.Rate"
gam.out <- gam(Outstate ~ Private + Room.Board + Personal + Terminal + perc.alumni + Expend + Grad.Rate, data = College.train)
summary(gam.out)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Outstate ~ Private + Room.Board + Personal + Terminal + perc.alumni +
## Expend + Grad.Rate
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.017e+03 7.997e+02 -3.773 0.000187 ***
## PrivateYes 2.322e+03 3.275e+02 7.088 6.65e-12 ***
## Room.Board 8.887e-01 1.247e-01 7.127 5.21e-12 ***
## Personal -5.419e-01 1.852e-01 -2.926 0.003639 **
## Terminal 4.528e+01 8.656e+00 5.231 2.79e-07 ***
## perc.alumni 3.828e+01 1.119e+01 3.420 0.000693 ***
## Expend 2.385e-01 2.507e-02 9.512 < 2e-16 ***
## Grad.Rate 2.867e+01 7.526e+00 3.809 0.000163 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.729 Deviance explained = 73.4%
## GCV = 4.4169e+06 Scale est. = 4.3258e+06 n = 388
plots <- visreg(gam.out, plot = FALSE)
df <- ldply(plots, function(part)
data.frame(variable = part$meta$x,
x=part$fit[[part$meta$x]],
smooth=part$fit$visregFit,
lower=part$fit$visregLwr,
upper=part$fit$visregUpr))
df %>% dplyr::filter(variable != 'Private') %>% mutate_at(vars(x:upper),funs(as.numeric)) %>% ggplot(aes(x, smooth)) + geom_line() +
geom_line(aes(y=lower), linetype="dashed") +
geom_line(aes(y=upper), linetype="dashed") +
facet_grid(. ~ variable, scales = "free_x")
preds <- predict(gam.out, newdata = College[-train, ])
mean((preds - College[-train, 'Outstate'])^2)
## [1] 3928241
There does not seem to be a nonlinear relationship among any of the variables.
Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing.
We now try this out on a toy example.
set.seed(123)
y <- rnorm(100)
x1 <- rnorm(100)
x2 <- rnorm(100)
beta1 <- 1
\[ Y − \hat{\beta}_1 X_1 = \beta_0 + \beta_2 X_2 + \epsilon . \] You can do this as follows:
a=y-beta1*x1
beta2=lm(a ~ x2)$coef[2]
\[ Y − \hat{\beta}_2 X_2 = \beta_0 + \beta_1 X_1 + \epsilon . \]
You can do this as follows:
a = y-beta2*x2
beta1 = lm(a ~ x1)$coef[2]
iterate <- function(){
a <- y-beta1*x1
beta2 <<- lm(a ~ x2)$coef[2]
a <- y-beta2*x2
beta1 <<- lm(a ~ x1)$coef[2]
return(c(lm(a ~ x1)$coef[1], beta1, beta2))
}
beta1 <- 1
mat <- replicate(1000, iterate())
df <- as.data.frame(t(mat))
names(df) <- c('beta0', 'beta1', 'beta2')
ggplot(df, aes(x = seq_along(beta0))) + geom_line(aes(y=beta0, color = 'beta0')) +
geom_line(aes(y=beta1, color = 'beta1')) +
geom_line(aes(y=beta2, color = 'beta2'))
abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).lm.out <- lm(y ~ x1 + x2)
coef(lm.out)
## (Intercept) x1 x2
## 0.10056651 -0.04306883 -0.12279500
It converges to the multiple linear regression coefficients in one pass.